pip install sodapy
Collecting sodapy
Downloading sodapy-2.2.0-py2.py3-none-any.whl (15 kB)
Collecting requests>=2.28.1
Downloading requests-2.31.0-py3-none-any.whl (62 kB)
|████████████████████████████████| 62 kB 1.8 MB/s eta 0:00:01
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/aidan/opt/anaconda3/lib/python3.9/site-packages (from requests>=2.28.1->sodapy) (1.26.7)
Requirement already satisfied: idna<4,>=2.5 in /Users/aidan/opt/anaconda3/lib/python3.9/site-packages (from requests>=2.28.1->sodapy) (3.2)
Requirement already satisfied: certifi>=2017.4.17 in /Users/aidan/opt/anaconda3/lib/python3.9/site-packages (from requests>=2.28.1->sodapy) (2021.10.8)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/aidan/opt/anaconda3/lib/python3.9/site-packages (from requests>=2.28.1->sodapy) (2.0.4)
Installing collected packages: requests, sodapy
Attempting uninstall: requests
Found existing installation: requests 2.26.0
Uninstalling requests-2.26.0:
Successfully uninstalled requests-2.26.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.
anaconda-project 0.10.1 requires ruamel-yaml, which is not installed.
Successfully installed requests-2.31.0 sodapy-2.2.0
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
from sodapy import Socrata
/Users/aidan/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.0' or newer of 'numexpr' (version '2.7.3' currently installed). from pandas.core.computation.check import NUMEXPR_INSTALLED /Users/aidan/opt/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/masked.py:62: UserWarning: Pandas requires version '1.3.4' or newer of 'bottleneck' (version '1.3.2' currently installed). from pandas.core import (
We're going to explore a large data set or traffic crashes to learn about what factors are connected with injuries. We will use data from the city of Chicago's open data portal. (This activity is derived from a blog post by Julia Silge)
client = Socrata("data.cityofchicago.org", None)
results = client.get("85ca-t3if", where="CRASH_DATE > '2022-01-01'")
# Convert to pandas DataFrame
crash_raw = pd.DataFrame.from_records(results)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.
crash_raw.columns
Index(['crash_record_id', 'crash_date', 'posted_speed_limit',
'traffic_control_device', 'device_condition', 'weather_condition',
'lighting_condition', 'first_crash_type', 'trafficway_type',
'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
'crash_type', 'hit_and_run_i', 'damage', 'date_police_notified',
'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
'most_severe_injury', 'injuries_total', 'injuries_fatal',
'injuries_incapacitating', 'injuries_non_incapacitating',
'injuries_reported_not_evident', 'injuries_no_indication',
'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
'latitude', 'longitude', 'location', 'intersection_related_i',
'statements_taken_i', 'photos_taken_i', 'private_property_i',
'crash_date_est_i', 'dooring_i', 'work_zone_i', 'work_zone_type',
'workers_present_i'],
dtype='object')
This dataset is pretty crazy! Let's do some data munging to get it into a nicer form.
injuries which indicates if the crash involved injuries or not.report_types# convert the 'Date' column to datetime format
crash_raw['crash_date'] = pd.to_datetime(crash_raw['crash_date'])
crash_raw['injuries'] = (pd.to_numeric(crash_raw['injuries_total']) > 0)
crash = crash_raw[['crash_date', 'injuries', 'latitude', 'longitude']]
crash_raw.columns
Index(['crash_record_id', 'crash_date', 'posted_speed_limit',
'traffic_control_device', 'device_condition', 'weather_condition',
'lighting_condition', 'first_crash_type', 'trafficway_type',
'alignment', 'roadway_surface_cond', 'road_defect', 'report_type',
'crash_type', 'hit_and_run_i', 'damage', 'date_police_notified',
'prim_contributory_cause', 'sec_contributory_cause', 'street_no',
'street_direction', 'street_name', 'beat_of_occurrence', 'num_units',
'most_severe_injury', 'injuries_total', 'injuries_fatal',
'injuries_incapacitating', 'injuries_non_incapacitating',
'injuries_reported_not_evident', 'injuries_no_indication',
'injuries_unknown', 'crash_hour', 'crash_day_of_week', 'crash_month',
'latitude', 'longitude', 'location', 'intersection_related_i',
'statements_taken_i', 'photos_taken_i', 'private_property_i',
'crash_date_est_i', 'dooring_i', 'work_zone_i', 'work_zone_type',
'workers_present_i', 'injuries'],
dtype='object')
# Assuming 'crash_raw' is your original DataFrame
selected_columns = ['longitude', 'latitude', 'injuries', 'lighting_condition', 'weather_condition', 'crash_month', 'prim_contributory_cause', 'sec_contributory_cause','crash_date']
# Create a new DataFrame with only the selected columns
crash_filtered = crash_raw[selected_columns]
# Drop rows where 'latitude' or 'longitude' is NaN
crash_filtered = crash_filtered.dropna(subset=['latitude', 'longitude'])
Here's a few questions to get you started.
Take a look at crashes by latitude and longitude, colored by injuries. What do you notice?
What are the most common contributing factors to a crash?
How do crashes vary month by month? Compare crashes by month in 2022 to 2023.
Are crashes more likely to cause injuries when it is rainy and dark? Use the variables weather_condition and lighting_condition to explore.
Choose a question you want to explore, and create an appropriate visual.
crash_raw['injuries']
0 False
1 False
2 False
3 False
4 False
...
995 False
996 False
997 False
998 False
999 False
Name: injuries, Length: 1000, dtype: bool
crash_raw['longitude']
0 -87.775099001
1 -87.631851672
2 -87.633011933
3 -87.675292438
4 -87.753893377
...
995 -87.624252186
996 -87.767138108
997 -87.687248767
998 -87.76844737
999 -87.651162183
Name: longitude, Length: 1000, dtype: object
import matplotlib.pyplot as plt
# Assuming 'latitude', 'longitude', and 'injuries' are columns in your DataFrame
latitudes = crash_filtered['latitude'].astype(float)
longitudes = crash_filtered['longitude'].astype(float)
injuries = crash_filtered['injuries']
# Convert injuries to numeric values (boolean to int)
injuries_numeric = injuries.astype(int)
# Create separate scatter plots for 'Injuries' and 'No Injuries'
plt.scatter(longitudes[injuries_numeric == 1], latitudes[injuries_numeric == 1], c='red', marker='o', label='Injuries', alpha=0.6)
plt.scatter(longitudes[injuries_numeric == 0], latitudes[injuries_numeric == 0], c='green', marker='o', label='No Injuries', alpha=0.6)
# Set labels and title
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Chicago Car Crashes')
# Add legend
plt.legend()
# Show the plot
plt.show()
plt.savefig('crash_data_visualization.jpg')
<Figure size 432x288 with 0 Axes>
import matplotlib.pyplot as plt
import pandas as pd
# Assuming 'prim_contributory_cause' and 'sec_contributory_cause' are columns in your DataFrame
combined_causes = pd.concat([crash_filtered['prim_contributory_cause'], crash_filtered['sec_contributory_cause']])
# Filter out 'UNABLE TO DETERMINE' and 'NOT APPLICABLE'
filtered_causes = combined_causes[~combined_causes.isin(['UNABLE TO DETERMINE', 'NOT APPLICABLE'])]
# Count the occurrences of each cause
cause_counts = filtered_causes.value_counts()
# Create a bar chart
plt.bar(cause_counts.head(5).index, cause_counts.head(5), color='blue')
plt.xlabel('Contributory Causes')
plt.ylabel('Frequency')
plt.title('Top 5 Contributory Causes to Car Crashes')
plt.xticks(rotation=30) # Rotate x-axis labels for better visibility
# Show the plot
plt.show()
plt.savefig('Causes_Ranked.jpg')
<Figure size 432x288 with 0 Axes>
crash_filtered
| longitude | latitude | injuries | lighting_condition | weather_condition | crash_month | prim_contributory_cause | sec_contributory_cause | crash_date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -87.775099001 | 41.909868519 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | FAILING TO REDUCE SPEED TO AVOID CRASH | UNABLE TO DETERMINE | 2024-02-05 00:02:00 |
| 1 | -87.631851672 | 41.911257045 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | DISREGARDING TRAFFIC SIGNALS | NOT APPLICABLE | 2024-02-04 22:27:00 |
| 2 | -87.633011933 | 41.912503239 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | UNABLE TO DETERMINE | UNABLE TO DETERMINE | 2024-02-04 22:15:00 |
| 3 | -87.675292438 | 42.002600373 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | UNDER THE INFLUENCE OF ALCOHOL/DRUGS (USE WHEN... | NOT APPLICABLE | 2024-02-04 22:00:00 |
| 4 | -87.753893377 | 41.925831886 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | UNABLE TO DETERMINE | UNABLE TO DETERMINE | 2024-02-04 21:42:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | -87.624252186 | 41.743546758 | False | DAYLIGHT | CLEAR | 1 | NOT APPLICABLE | NOT APPLICABLE | 2024-01-31 16:12:00 |
| 996 | -87.767138108 | 41.882099191 | False | DAYLIGHT | CLEAR | 1 | OPERATING VEHICLE IN ERRATIC, RECKLESS, CARELE... | NOT APPLICABLE | 2024-01-31 16:00:00 |
| 997 | -87.687248767 | 41.917103631 | False | DAYLIGHT | CLEAR | 1 | DRIVING SKILLS/KNOWLEDGE/EXPERIENCE | DRIVING SKILLS/KNOWLEDGE/EXPERIENCE | 2024-01-31 15:54:00 |
| 998 | -87.76844737 | 41.943227521 | False | DAYLIGHT | CLEAR | 1 | FOLLOWING TOO CLOSELY | FOLLOWING TOO CLOSELY | 2024-01-31 15:50:00 |
| 999 | -87.651162183 | 41.869615798 | False | DAYLIGHT | CLEAR | 1 | FAILING TO YIELD RIGHT-OF-WAY | NOT APPLICABLE | 2024-01-31 15:49:00 |
989 rows × 9 columns
import pandas as pd
# Assuming 'crash_filtered' is your DataFrame with a 'crash_date' column
crash_filtered['year'] = crash_filtered['crash_date'].dt.year
# Now 'crash_filtered' has a new column 'year' containing the year of each crash
crash_filtered
| longitude | latitude | injuries | lighting_condition | weather_condition | crash_month | prim_contributory_cause | sec_contributory_cause | crash_date | year | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -87.775099001 | 41.909868519 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | FAILING TO REDUCE SPEED TO AVOID CRASH | UNABLE TO DETERMINE | 2024-02-05 00:02:00 | 2024 |
| 1 | -87.631851672 | 41.911257045 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | DISREGARDING TRAFFIC SIGNALS | NOT APPLICABLE | 2024-02-04 22:27:00 | 2024 |
| 2 | -87.633011933 | 41.912503239 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | UNABLE TO DETERMINE | UNABLE TO DETERMINE | 2024-02-04 22:15:00 | 2024 |
| 3 | -87.675292438 | 42.002600373 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | UNDER THE INFLUENCE OF ALCOHOL/DRUGS (USE WHEN... | NOT APPLICABLE | 2024-02-04 22:00:00 | 2024 |
| 4 | -87.753893377 | 41.925831886 | False | DARKNESS, LIGHTED ROAD | CLEAR | 2 | UNABLE TO DETERMINE | UNABLE TO DETERMINE | 2024-02-04 21:42:00 | 2024 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | -87.624252186 | 41.743546758 | False | DAYLIGHT | CLEAR | 1 | NOT APPLICABLE | NOT APPLICABLE | 2024-01-31 16:12:00 | 2024 |
| 996 | -87.767138108 | 41.882099191 | False | DAYLIGHT | CLEAR | 1 | OPERATING VEHICLE IN ERRATIC, RECKLESS, CARELE... | NOT APPLICABLE | 2024-01-31 16:00:00 | 2024 |
| 997 | -87.687248767 | 41.917103631 | False | DAYLIGHT | CLEAR | 1 | DRIVING SKILLS/KNOWLEDGE/EXPERIENCE | DRIVING SKILLS/KNOWLEDGE/EXPERIENCE | 2024-01-31 15:54:00 | 2024 |
| 998 | -87.76844737 | 41.943227521 | False | DAYLIGHT | CLEAR | 1 | FOLLOWING TOO CLOSELY | FOLLOWING TOO CLOSELY | 2024-01-31 15:50:00 | 2024 |
| 999 | -87.651162183 | 41.869615798 | False | DAYLIGHT | CLEAR | 1 | FAILING TO YIELD RIGHT-OF-WAY | NOT APPLICABLE | 2024-01-31 15:49:00 | 2024 |
989 rows × 10 columns
import matplotlib.pyplot as plt
import pandas as pd
# Assuming 'crash_filtered' is your DataFrame with the 'year' and 'crash_month' columns
# Group by year and month and count crashes
crashes_by_year_month = crash_filtered.groupby(['year', 'crash_month']).size().unstack()
# Create a grouped bar chart
crashes_by_year_month.plot(kind='bar', stacked=True, colormap='viridis')
# Set labels and title
plt.xlabel('Month')
plt.ylabel('Number of Crashes')
plt.title('Number of Crashes by Month and Year')
# Show the plot
plt.show()
import pandas as pd
from scipy.stats import chi2_contingency
# Assuming 'crash_filtered' is your DataFrame with relevant columns
contingency_table = pd.crosstab(index=crash_filtered['weather_condition'],
columns=[crash_filtered['lighting_condition'], crash_filtered['injuries']],
margins=True, margins_name='Total', rownames=['Weather'], colnames=['Lighting', 'Injuries'])
# Perform chi-square test for independence
chi2, p, _, _ = chi2_contingency(contingency_table)
# Output results
print(f"Chi-square value: {chi2}")
print(f"P-value: {p}")
# Interpret the results based on the p-value
if p < 0.05:
print("There is a significant association between weather_condition, lighting_condition, and injuries.")
else:
print("There is no significant association between weather_condition, lighting_condition, and injuries.")
Chi-square value: 548.9353777929671 P-value: 4.151392192467228e-80 There is a significant association between weather_condition, lighting_condition, and injuries.
##WHAT is the spatial distribution of different causes of accidnets?
import matplotlib.pyplot as plt
import pandas as pd
# Assuming 'latitude', 'longitude', 'injuries', and 'prim_contributory_cause' are columns in your DataFrame
latitudes = crash_filtered['latitude'].astype(float)
longitudes = crash_filtered['longitude'].astype(float)
injuries = crash_filtered['injuries']
prim_contributory_cause = crash_filtered['prim_contributory_cause']
# Convert injuries to numeric values (boolean to int)
injuries_numeric = injuries.astype(int)
# Define colors based on primary contributory cause
color_mapping = {
'FAILING TO YIELD RIGHT-OF-WAY': 'red',
'WEATHER': 'blue',
'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE': 'green',
'FOLLOWING TOO CLOSELY': 'yellow',# Add more colors if needed
}
colors = prim_contributory_cause.map(color_mapping).fillna('gray') # Use gray for unknown causes
# Create a scatter plot with colors based on primary contributory cause
plt.scatter(longitudes, latitudes, c=colors, marker='o', alpha=0.8)
# Set labels and title
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Primary Causes of Crashes')
#legend_labels = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=10) for color in color_mapping.values()]
#plt.legend(legend_labels, color_mapping.keys(), title='Contributory Causes', loc='best')
# Show the plot
plt.show()
import plotly.express as px
import pandas as pd
# Assuming 'latitude', 'longitude', 'injuries', and 'prim_contributory_cause' are columns in your DataFrame
latitudes = crash_filtered['latitude'].astype(float)
longitudes = crash_filtered['longitude'].astype(float)
injuries = crash_filtered['injuries']
prim_contributory_cause = crash_filtered['prim_contributory_cause']
# Convert injuries to numeric values (boolean to int)
injuries_numeric = injuries.astype(int)
# Define colors based on primary contributory cause
color_mapping = {
'FAILING TO YIELD RIGHT-OF-WAY': 'red',
'WEATHER': 'blue',
'DRIVING SKILLS/KNOWLEDGE/EXPERIENCE': 'green',
'FOLLOWING TOO CLOSELY': 'yellow', # Add more colors if needed
}
colors = prim_contributory_cause.map(color_mapping).fillna('gray') # Use gray for unknown causes
# Create a DataFrame for Plotly
plotly_data = pd.DataFrame({'latitudes': latitudes, 'longitudes': longitudes, 'colors': colors})
# Create an interactive map plot with colors based on primary contributory cause
fig = px.scatter_mapbox(plotly_data,
lat='latitudes',
lon='longitudes',
color='colors',
color_discrete_map=color_mapping,
zoom=10)
# Set title
fig.update_layout(title='Primary Causes of Crashes')
# Show the interactive plot
fig.show()
pip install plotly
Collecting plotly
Downloading plotly-5.18.0-py3-none-any.whl (15.6 MB)
|████████████████████████████████| 15.6 MB 8.2 MB/s eta 0:00:01
Collecting tenacity>=6.2.0
Downloading tenacity-8.2.3-py3-none-any.whl (24 kB)
Requirement already satisfied: packaging in /Users/aidan/opt/anaconda3/lib/python3.9/site-packages (from plotly) (21.0)
Requirement already satisfied: pyparsing>=2.0.2 in /Users/aidan/opt/anaconda3/lib/python3.9/site-packages (from packaging->plotly) (3.0.4)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.18.0 tenacity-8.2.3
Note: you may need to restart the kernel to use updated packages.